In this case study, we’ll perform customer segmentation using k-means, a clustering algorithm
Clustering is an unsupervised machine learning task that automatically divides the data into clusters or groupings of similar items. It does this without having been told what the groups should look like ahead of time. As we may not even know what we’re looking for, clustering is used for knowledge discovery rather than prediction. It provides an insight into the natural groupings found within data.
The k-means is the widest used unsupervised machine learning algorithm. It basically tries to minimize the distance of the points in a cluster with their centroid. This results in the creation of groups/segments with different characteristics (as different as possible)
There are a lot of benefits when using segmentation in a business context. In particular when a business develops a successful customer segmentation, that means that instead of treating all customers the same, now each customer is assigned to a cluster. So the business can apply different strategies to each segment/cluster, which contribute to improving/optimizing e.g.:
Our company has 368 B2B customers and we need i) Create a centralized pricing policy regarding different segments of customers & ii) Improve our customer satisfaction & retention through targeted marketing campaigns.
The stakeholders decided to develop a customer segmentation to:
All the data used for the analysis are anonymized.
This is a very important step in every data science project, especially when performing clustering.
We have to work along with the stakeholders, as they will help us identify the important customer features and how we can obtain these (via ERP, CRM systems, Surveys e.t.c). In this step, you may spend most of your time during similar projects, as we need meaningful & important features to perform a successful segmentation.
At first we are loading all libraries.
library(tidyverse)
library(ggthemes)
library(DT)
Registered S3 method overwritten by 'htmlwidgets':
method from
print.htmlwidget tools:rstudio
library(d3heatmap)
library(plotly)
Registered S3 method overwritten by 'data.table':
method from
print.data.table
Attaching package: ‘plotly’
The following object is masked from ‘package:ggplot2’:
last_plot
The following object is masked from ‘package:stats’:
filter
The following object is masked from ‘package:graphics’:
layout
library(ggfortify)
# Set the black & white theme for all plots
theme_set(theme_bw())
# Use it to prevent scientific notation
options(scipen = 999)
We’ll use the final dataset, which is the result of the initial ETL & feature engineering step.
We use the read_csv() function (from readr library) to read the csv file in R.
customers <- read_csv(file = "data/customers.csv")
Parsed with column specification:
cols(
customer = col_character(),
area = col_character(),
certifications = col_double(),
last_year_revenue = col_double(),
new_products_prop = col_double(),
new_products_revenue = col_double(),
active_quarters = col_double(),
median_quarterly_dso = col_double(),
median_quarterly_balance = col_double(),
total_revenue = col_double(),
product_cat_A_revenue = col_double(),
product_cat_B_revenue = col_double(),
product_cat_C_revenue = col_double(),
product_cat_D_revenue = col_double()
)
customers %>% glimpse()
Rows: 368
Columns: 14
$ customer <chr> "Customer_1", "Customer_2", "Customer_3", "Customer_4", "Customer_5", "Cus…
$ area <chr> "Area_C", "Area_A", "Area_A", "Area_C", "Area_A", "Area_C", "Area_C", "Are…
$ certifications <dbl> 0, 0, 0, 1, 1, 0, 0, 0, 0, 3, 0, 0, 0, 6, 0, 9, 0, 1, 0, 5, 0, 1, 0, 0, 9,…
$ last_year_revenue <dbl> 14735.25, 1273.50, 8493.00, 8556.75, 18706.50, 9755.25, 446.25, 13469.25, …
$ new_products_prop <dbl> 33.66, 0.00, 27.90, 5.94, 3.51, 5.58, 15.84, 0.00, 18.36, 14.85, 0.00, 56.…
$ new_products_revenue <dbl> 5511.00, 0.00, 2636.25, 566.25, 721.50, 600.00, 78.75, 0.00, 195.75, 3820.…
$ active_quarters <dbl> 3, 2, 7, 54, 53, 3, 15, 3, 1, 57, 37, 4, 1, 38, 12, 19, 38, 54, 2, 4, 2, 3…
$ median_quarterly_dso <dbl> 82, 112, 123, 118, 82, 77, 82, 0, 0, 117, 82, 0, 82, 82, 0, 94, 108, 93, 0…
$ median_quarterly_balance <dbl> 1262.25, 594.15, 6261.10, 4063.85, 5218.15, 2476.90, 505.75, 10200.00, 0.0…
$ total_revenue <dbl> 2736.00, 735.00, 20482.50, 147867.75, 171960.00, 2983.50, 14721.75, 16469.…
$ product_cat_A_revenue <dbl> 7825.2375, 1273.8750, 616.1850, 4877.4375, 7594.0425, 9155.3850, 0.0000, 1…
$ product_cat_B_revenue <dbl> 5071.500, 0.000, 0.000, 225.000, 568.875, 0.000, 78.750, 0.000, 195.420, 2…
$ product_cat_C_revenue <dbl> 1170.000, 0.000, 0.000, 912.000, 2283.750, 0.000, 0.000, 0.000, 0.000, 226…
$ product_cat_D_revenue <dbl> 369.000, 0.000, 7756.642, 1365.007, 3317.393, 0.000, 367.500, 262.500, 0.0…
All variables are anonymized.
It’s very important to check for missing data before our analysis. Especially when we planning to perform clustering, because k-means algorithm can’t handle missing data.
We haven’t got any missing values so we can proceed.
K-means algorithm can be applied only in continuous variables. So all the variables that we’re using are continuous. Below we check the distribution of each variable
We can see that most of our variables, especially currency-related variables, have a skewness towards large values. It would be a good idea to use a logarithmic scale. In any case, prior to apply clustering we need to scale our variables, so it is a good idea to create some plots using logarithmic scale.
customers %>%
select(-customer, -area) %>%
gather(key = "Variable", value = "Value") %>%
ggplot(aes(log(Value))) +
geom_histogram(bins = 20) +
facet_wrap(~ Variable, scales = "free") +
labs(
title = "Continuous variables histograms with log values",
x = ""
)
A lot of the log-transformed variables (almost all revenue-related vars) are close to a normal distribution (Log-normal).
It is a chart that presents correlation between the dataset variables.
corrplot(cor(select(customers, -exclude)), type="upper", order="hclust")
Note: Using an external vector in selections is ambiguous.
ℹ Use `all_of(exclude)` instead of `exclude` to silence this message.
ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
This message is displayed once per session.
We can see that there are some strong correlations between the variables. Especially between all currency-related variables. It seems that almost all currency variables are following the same pattern.
Essentially presents overlapping area charts, which make it easy to compare all the distributions
select(customers, -exclude) %>%
mutate_all(scale) %>%
gather(key = "variable", value = "value") %>%
mutate(value = scale(value)) %>%
ggplot(aes(x = value,
y = variable, fill = ..x..)) +
ggridges::geom_density_ridges_gradient(rel_min_height = 0.0) +
viridis::scale_fill_viridis(name = "") +
ggridges::theme_ridges(font_size = 13, grid = TRUE) +
theme(axis.title.y = element_blank(),
axis.title.x = element_blank(),
axis.text.x = element_text(size = 12),
text = element_text(family = "mono")) +
scale_x_continuous(limits = c(-2, 2))+
labs(title = 'Density plots of variables',
subtitle = 'All values are scaled')
attributes are not identical across measure variables;
they will be dropped
Here we can see again that the distribution of all revenue-related variables are similar.
At first we’ll discover the optimal number of clusters for our dataset. There are two techniques for that, elbow plot & silhouette plot.
Plot of total within cluster sum of squares (the sum of euclidean distances between each observation and the centroid corresponding to the cluster to which the observation is assigned)
The idea here is that we stop adding new segments when the additional information (increase on total within sum of squares) is not significant. In this case 4 & 5 clusters seems OK.
Finally we choose to perform the segmentation with 5 segments because:
model_customers
K-means clustering with 5 clusters of sizes 87, 27, 5, 106, 143
Cluster means:
certifications last_year_revenue new_products_prop new_products_revenue active_quarters
1 -0.24205640 -0.2151471 1.4021594 -0.05284781 -0.5972864
2 2.15836578 1.4977622 0.1062848 1.39054867 0.7585746
3 3.11953902 6.7245088 0.1467196 6.48816313 1.1675964
4 0.03982787 -0.0920378 -0.1739955 -0.14312797 1.0623485
5 -0.39885607 -0.3187994 -0.7492841 -0.35116297 -0.6081435
median_quarterly_dso median_quarterly_balance total_revenue product_cat_A_revenue product_cat_B_revenue
1 -0.2274370 -0.27283231 -0.33720357 -0.2438423 -0.07061636
2 0.7564557 1.77153804 1.56472299 1.6544661 1.58426684
3 0.9423862 6.06299794 6.17201374 6.1651859 5.91217991
4 0.5955087 -0.04405098 0.08484561 -0.0879380 -0.13168010
5 -0.4788331 -0.34783708 -0.36898262 -0.3144112 -0.36527545
product_cat_C_revenue product_cat_D_revenue
1 -0.28371813 -0.19453927
2 1.47054403 1.13789701
3 5.71348767 6.69482156
4 -0.01175898 -0.08857925
5 -0.29609929 -0.26491615
Clustering vector:
[1] 1 5 1 4 4 5 5 5 5 4 4 1 5 4 5 2 4 4 5 1 5 4 1 5 3 1 5 1 1 4 2 2 4 4 5 4 1 5 4 5 3 1 1 5 4 5 5 2 1 4 5 1
[53] 1 5 5 5 1 1 1 4 5 4 4 4 5 2 5 1 5 5 4 5 5 5 1 1 5 1 1 4 1 1 2 1 5 5 1 1 5 5 1 5 2 1 2 4 4 4 5 1 1 1 5 5
[105] 2 1 1 2 5 5 1 1 4 1 4 4 1 4 5 5 1 4 5 5 5 5 5 1 5 3 4 2 4 5 4 5 4 5 5 2 5 3 1 4 5 4 4 2 5 4 1 2 1 5 5 1
[157] 1 5 1 4 4 4 4 5 4 5 5 5 1 4 5 4 5 5 5 1 5 5 4 4 5 5 5 5 5 4 4 1 5 5 5 4 5 5 4 4 4 5 1 5 2 1 4 5 5 5 4 4
[209] 4 5 4 5 4 4 5 4 5 5 2 1 5 1 2 5 5 5 4 5 4 5 4 5 4 5 1 4 1 1 1 1 5 4 4 5 5 1 4 4 2 5 5 5 2 2 5 4 4 4 4 4
[261] 4 5 5 2 4 5 5 1 4 4 5 5 5 4 4 5 4 1 5 5 4 3 1 4 4 1 5 1 2 1 2 5 5 5 4 5 4 4 4 2 1 4 5 4 1 5 5 4 2 1 5 5
[313] 5 1 4 4 5 1 4 2 1 1 1 5 5 5 4 5 1 1 1 5 4 1 5 2 1 4 1 1 4 1 4 5 1 5 5 4 5 5 4 4 4 1 5 1 4 5 5 4 1 5 1 4
[365] 1 5 5 5
Within cluster sum of squares by cluster:
[1] 191.4711 354.9338 222.1625 489.0372 137.8794
(between_SS / total_SS = 68.3 %)
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss" "betweenss" "size"
[8] "iter" "ifault"
This is an easy way to check the differences & similarities of all segments. The center of the cluster is the average of all points (elements) that belong to that cluster. All values are scaled
The more green - the higher the value.
Segment 3 containing just 5 customers. These are our “top” customers. They tend to generate way more revenue, have a lot of certifications, purchase quite a big proportion of new products and have a high DSO
Segment 2 contains 27 customers. They are “close to the top” customers as they tend to generate high revenue, maintain a high number of certifications, purchase quite a big proportion of new products and have a high DSO
Segment 4 contains 106 customers. They are the “average” customers as they tend to generate average revenues, have a small number of certifications, purchase an average proportion of new products and have an average DSO
Segment 1 contains 87 customers. They are the “promising” customers as they tend to generate a relatively low revenue, have a small number of certifications, but they purchase a very high proportion of new products and have a small DSO
Segment 5 contains 143 customers. They are the “under-performing” customers as they tend to generate a very low revenue, have a very small number of certifications, purchase a very low proportion of new products and have an small DSO
Below there is a table with information about all clusters (with all variables)
# Calculate the mean for each category
segment_customers %>%
group_by(cluster) %>%
add_tally() %>%
summarise_each(funs(round(mean(.),2))) %>%
select(1, "n", everything()) %>%
arrange(-last_year_revenue) %>%
datatable(filter = 'top', options = list(pageLength = 5, autoWidth = TRUE, dom = 'pt'))
`summarise_each_()` is deprecated as of dplyr 0.7.0.
Please use `across()` instead.
This warning is displayed once every 8 hours.
Call `lifecycle::last_warnings()` to see where this warning was generated.
It is a dimensionality reduction algorithm, that allows you to summarize the information content in large datasets by means of a smaller set of features that can be more easily visualized and analyzed. It will assist us in segmentation visualization (Biplot).
summary(pca_customers)
Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 PC11
Standard deviation 2.7588 1.1009 1.03266 0.79920 0.7421 0.59581 0.48581 0.36453 0.33325 0.24355 0.12546
Proportion of Variance 0.6342 0.1010 0.08887 0.05323 0.0459 0.02958 0.01967 0.01107 0.00925 0.00494 0.00131
Cumulative Proportion 0.6342 0.7352 0.82409 0.87732 0.9232 0.95280 0.97246 0.98354 0.99279 0.99773 0.99905
PC12
Standard deviation 0.10701
Proportion of Variance 0.00095
Cumulative Proportion 1.00000
This plot shows all the original observations plotted on the first 2 principal components (which contain almost 75% of the original dataset variance). It also shows the original features mapped as vectors.
It shows that there are 3 distinct groups of variables: - active_quarters & median_dso - new_products_prop - All the revenue-related variables
ggplotly(
pca_customers$x %>%
as.data.frame() %>%
mutate(CLIENT = customers$customer,
Cluster = as.factor(segment_customers$cluster),
revenue = segment_customers$last_year_revenue,
newbusiness = segment_customers$new_products_prop,
certifications = segment_customers$certifications) %>%
select(CLIENT, Cluster, PC1, PC2, revenue, newbusiness, certifications) %>%
ggplot(aes(PC1, PC2, color = Cluster, text = paste(CLIENT, "\n Revenue: ",
# scales::dollar(revenue, prefix = "€"),
revenue,
"\n New Prod. Rev.:", newbusiness, "%\nCertifications:",
certifications,
sep = "") )) +
stat_ellipse(aes(label = Cluster, group = Cluster), type = "norm", level = 0.70) +
geom_point()+
theme_fivethirtyeight() +
labs(title = "Business Biplot of customers",
x = "",
y = "") +
scale_color_discrete(name = "Segments") +
theme_hc(),
tooltip = "text")
Ignoring unknown aesthetics: labelplotly.js does not (yet) support horizontal legend items
You can track progress here:
https://github.com/plotly/plotly.js/issues/53
`group_by_()` is deprecated as of dplyr 0.7.0.
Please use `group_by()` instead.
See vignette('programming') for more help
This warning is displayed once every 8 hours.
Call `lifecycle::last_warnings()` to see where this warning was generated.
We can see here how “well” the segments are separated. For example:
- Segment 3 (“top customers”) are well separated from the rest
- Segment 2 (“Close to top customers”) are just next to the top customers. - Segments 1 & 5 are overlapping.
Almost all segments are clearly distinguished.
Reports the segment & details of all customers.
The chosen “optimal” number of clusters is 5, which is within the business objectives of the company. It is crucial to work with the stakeholders while running the analysis, in order to produce a useful output.
Segment | # customers | Description | Performance evaluation
3 | 5 | “Top” | Very high on all vars 2 | 27 | “Close to top” | High on all vars 4 | 106 | “Average” | Average on all vars
1 | 87 | “Promising” | Very high new products rev. & low on the rest 5 | 143 | “Under-performers”| Low on all vars
We show that almost all revenue-related variables are correlated. We could remove some or most of these variables as they don’t really help the segmentation process.
In order to improve our segmentation we need more features. We can meet with the stakeholders and work close to them to find or generate some more features.
We can try more clustering algorithms, like hierarchical clustering, DBSCAN etc.